by Julie Olin, Mik Lokdam and Julian Roin Skovhus
## Basics
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
from PIL import Image
import seaborn as sns
import json
import branca.colormap as cm
import geopandas as gpd
import warnings
## Interactive maps
from bokeh.io import output_notebook
from bokeh.layouts import row
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, ranges, Legend
from bokeh.palettes import Category20
from folium.plugins import HeatMap
output_notebook()
import math
import folium
## Modelling
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
warnings.filterwarnings('ignore')
colors = sns.color_palette("plasma")
In this project our study area is confined to New York, the Big Apple, the City that Never Sleeps. By looking at the 4 different datasets (summarized in the table below), we wish to create an interactive guide that can help tourists to choose where to stay depending on their needs and wants.
Besides information on the AirbBnb listings, we also include information on crime, noise complaints and cafes/restaurants. One could include many other data sets to describe a city, but we considered these to be key contributors to the niceness of a neighborhood. Additionally, the included data set needed to have spatial dimensions, which limited the amount of useable data sets significantly.
With over AirBnB 45000+ listings in New York in 2016, there are plenty of places to choose from and given the fact that there are more than 19 mio. residents in a fairly small area, the diversity is high and districts have their own 'spark'. To help tourists in choosing the right district without having to read a myriad of blogs, we have created this visual guide based on explorative and explanative data analysis.
| Name | Used features | Size | Year | Source |
|---|---|---|---|---|
| AirBnb | Id, lat, lon, price, availability, neighborhood, reviews, hosts etc. | 7MB | 2019 | Kaggle |
| Crime data | Id, lat, lon | 1.7GB | 2017-18 | Kaggle |
| Noise complaints | Id, lat, lon | 53MB | 2016 | Kaggle |
| Restaurants | Id, lat, lon | 139MB | 2018 | Kaggle |
We'll briefly explain the attributes in the AirBnb data set as this is essential to understand the following analysis
In this section we will perform an exploratory data analysis of the AirBnb data set. We will mainly focus on investigating price, location and hosts. Towards the end, we will introduce the other data sets.
For now, we will only load the AirBnb data set and an shp-file including the districts of Manhattan.
df_bnb = pd.read_csv('Air_bnb/AB_NYC_2019.csv')
fname = 'manhat_last.shp'
nil = gpd.read_file(fname)
Let's start by inspecting the first few rows of the data set.
df_bnb.head(10)
We can see that if no reviews of a listing have been given the attributes 'last_review' and 'review_per_month' are set to NaN. We will simply set them to zero instead.
df_bnb.fillna({'reviews_per_month':0}, inplace=True)
df_bnb.fillna({'last_review':0}, inplace=True)
Lets try to identify other nan values within the data.
df_bnb.isna().sum()
We see that it is only the name of AirBnb listing and the name which are missing. None are essential for our analysis, so we will keep them as nans.
Initially, we look at the five large boroughs of New York. Lets see how the AirBnbs are distributed across the boroughs.
## Amount across boroughs
df_borough_count = df_bnb.groupby('neighbourhood_group').count()
df_borough_count['id'] = df_borough_count['id']/len(df_bnb)
fig, ax = plt.subplots(figsize=(18,8))
(df_borough_count.sort_values(by = 'id', ascending = False))['id'].plot(kind = 'bar', ax = ax, rot = 0, color=colors[3])
ax.set_xlabel('Borough')
ax.set_ylabel('Air Bnb listings [%]')
plt.show()
Majority of listings are in Manhattan and Brooklyn. Let us take a quick loot at the price distributions across the boroughs. To this end, we will use a violin plot which is very similar to a boxplot. However, as the violin plot also includes a probability density at different prices. This helps us identify the several peaks in prices that we expect.
## Price across boroughs
fig, ax = plt.subplots(figsize=(18, 8))
sns.set_theme(style="whitegrid")
v2=sns.violinplot(ax = ax, data=df_bnb[df_bnb['price']<500], x='neighbourhood_group', y='price',
palette = 'plasma', order=[ "Manhattan", "Brooklyn", "Queens", 'Bronx', 'Staten Island'])
v2.set_title('Density and distribution of prices for each neighberhood_group')
ax.set(xlabel='Borough', ylabel='Price pr. night [$]')
plt.show()
It appears that Manhattan has a higher median price of around 75\$ than the other five boroughs. This was expected as Manhattan is centrally located and known for it is expensive housing prices which would affect the rental price as well. For Brooklyn and Manhattan we can identify the small bulges as integer prices (100\\$, 150\$, 200\\$ etc.). We have only considered the rentals below 500\$ as there are extreme expensive rentals, which would make the figure hard to read.
We focus the remainder of the analysis on Manhattan. This is because it is the most famous borough in New York and we want the analysis to help people who want the real New York-experience. It will also help us do more detailed analysis on specific neighborhoods. The downside is of course loss of data, but we still have around 22.000 rows.
Let us consider the distribution of Airbnbns acorss the neighborhoods in Manhattan
## Basic stats
df_manhattan = df_bnb[df_bnb['neighbourhood_group'] == 'Manhattan']
df_manhattan_count = df_manhattan.groupby('neighbourhood').count()
fig, ax = plt.subplots(figsize=(16,7))
(df_manhattan_count.sort_values(by = 'id', ascending = False))['id'].plot(kind = 'bar', ax = ax, rot = 60, color = colors[3])
ax.set(xlabel='Neighbourhood', ylabel='No of AirBnb listings')
plt.show()
Harlem has by the most listings followed by Upper West Side and Hell's Kitchen. It appears that Stuyvesant Town and Marble Hill have no listings. However, they do have 37 and 12 respectively.
Lets take a look at the price distribution across all neighbourhoods
fig, ax = plt.subplots(figsize=(16,7))
(df_manhattan['price'].plot(kind = 'hist', ax = ax, rot = 60, color = colors[3], bins = 100))
plt.show()
It appears that the histogram is "drawn out" by high prices at around 10000$ pr. night! Lets inspect the most expensive AirBnb listings in Manhattan.
df_manhattan[df_manhattan['price']>5000]
Words as luxury, spacious, yacht, penthouse and >3000sq.ft all appear in the name of the apartments which maybe justifies the astronomical prices. The private rooms in Lower East Side and Upper West Side and aparatmens with a price pr. night of 9999\$ are almost certainly mistakes by the host, but we cannot know for sure. Hence we will keep them in the data set. Although the most expensive apartments are interesting, we get more information from the histogram by investigating apartments below a threshold as for the violin plots earlier.
fig, ax = plt.subplots(figsize=(18,8))
(df_manhattan[df_manhattan['price']<1000]['price'].plot(kind = 'hist', ax = ax, rot = 60, color = colors[3], bins = 100))
ax.set_xticks(np.arange(0,1001,50))
ax.set_xlabel('Price pr. night [$]')
plt.show()
We see that the shape looks to be a right skewed normal distribution with most listings with a price of 150\$ pr. night. When looking at mean price pr. night across neighborhoods, we look at rentals below 1000\\$ pr. night. The reason for this is that 1) the few, very expensive apartments scew the prices, especially in the neighborhoods with few listings 2) this an analysis for the people, not the filthy rich. The downside is loss of data, but it is only around 172 listings.
df_manhattan_price = df_manhattan[df_manhattan['price']<1000].groupby('neighbourhood').mean()
fig, ax = plt.subplots(figsize=(18,8))
(df_manhattan_price.sort_values(by = 'price', ascending = False))['price'].plot(kind = 'bar', ax = ax, rot = 70, color = colors[3])
ax.set_ylabel('Mean price pr. night [$]')
ax.set_xlabel('Neighborhood')
plt.show()
We see that Tribeca is quite in front in mean price followed by NoHo, Flatiron District and Midtown. Later we will look into the spatial influence on the price, but first will look into rental types, availability, reviews and introduce the other data sets that we will use for the analysis.
We study the avg. price pr. night across the three different room types in an interactive plot with Bokeh. We first use group by on 'room_type' and 'neighborhood' and then we convert the dataframe into the required format for Bokeh. Some neighborhoods have no shared rooms, hence we have to manually add a row with this information.
## Computing the avg. price pr. room type for all neighborhoods
df_manhattan_avgprice = df_manhattan[['neighbourhood', 'room_type','price']]
df_add = pd.DataFrame([['Civic center', 'Shared room', 0], ['Flatiron District', 'Shared room', 0],['Marble Hill', 'Shared room', 0],
['NoHo', 'Shared room', 0], ['Tribeca', 'Shared room', 0], ['Two Bridges', 'Shared room', 0]], columns= ['neighbourhood', 'room_type', 'price'])
df_manhattan_avgprice = df_manhattan_avgprice.append(df_add, ignore_index=True)
df_manhattan_avgprice = df_manhattan_avgprice.groupby(['room_type', 'neighbourhood']).mean()
## Reformatting
df_plot = pd.DataFrame()
room_type = ['Entire home/apt', 'Private room', 'Shared room']
for i in room_type:
df_plot[i] = np.ravel(df_manhattan_avgprice.loc[i].values)
df_plot['district'] = ['Battery Park City', 'Chelsea', 'Chinatown', 'Civic Center',
'East Harlem', 'East Village', 'Financial District',
'Flatiron District', 'Gramercy', 'Greenwich Village', 'Harlem',
"Hell's Kitchen", 'Inwood', 'Kips Bay', 'Little Italy',
'Lower East Side', 'Marble Hill', 'Midtown', 'Morningside Heights',
'Murray Hill', 'NoHo', 'Nolita', 'Roosevelt Island', 'SoHo',
'Stuyvesant Town', 'Theater District', 'Tribeca', 'Two Bridges',
'Upper East Side', 'Upper West Side', 'Washington Heights',
'West Village']
The cell below produces the plot. We will not split it further for additional explanation.
cols = ['firebrick', 'mediumblue', 'gold']
room_type = ['Entire home/apt', 'Private room', 'Shared room']
source = ColumnDataSource(df_plot)
dist = (np.array(['Battery Park City', 'Chelsea', 'Chinatown', 'Civic Center',
'East Harlem', 'East Village', 'Financial District',
'Flatiron District', 'Gramercy', 'Greenwich Village', 'Harlem',
"Hell's Kitchen", 'Inwood', 'Kips Bay', 'Little Italy',
'Lower East Side', 'Marble Hill', 'Midtown', 'Morningside Heights',
'Murray Hill', 'NoHo', 'Nolita', 'Roosevelt Island', 'SoHo',
'Stuyvesant Town', 'Theater District', 'Tribeca', 'Two Bridges',
'Upper East Side', 'Upper West Side', 'Washington Heights',
'West Village'])).tolist()
p = figure(plot_height=450, plot_width=900, x_range = dist, title = 'Avg. price for room types for all neighbourhoods')
p.xaxis.axis_label = "Neighbourhood"
p.yaxis.axis_label = "Avg. price pr. night"
bar ={}
for indx, i , color in zip(np.arange(len(room_type)), room_type, cols):
bar[i] = p.vbar(x='district', top=i, source=source,
muted_alpha=0.02, width = 0.5, alpha = 0.7, color = color)
items = []
for indx, i in enumerate(room_type):
items.append((i, [bar[i]]))
legend = Legend(items=items, location=(5, 0))
p.add_layout(legend, 'right')
p.legend.click_policy="mute"
p.xaxis.major_label_orientation = math.pi/3
show(p)
Interesting points here is that Tribeca still has the most expensive Entire home/apt, but the private rooms of Midtown and West Village are more expensive. With relative few listings in the Financial District the average price of a shared room is almost as much as an entire home/apt. This is unlikely and probably rather due to one a two shared rooms skewing the prices. There is lots of more information - tinker with it yourself!
As a final step in the the non-spatial exploration of the Airbnb dataset, we will take a look at reviews. Reviews are always a good indicator of an Airbnb and probably one of the first things we look at when we have found a potential vacation home. We will make a boxplot of number of reviews for some of the neighborhoods with most listings.
most_listings = ['Harlem', 'Upper West Side', 'Hell\'s Kitchen', 'East Village', 'Upper East Side', 'Chelsea', 'West Village', 'Midtown', 'Tribeca']
df_small = df_manhattan[df_manhattan.neighbourhood.isin(most_listings)]
fig, ax = plt.subplots(figsize=(14, 8))
b2 = sns.boxplot(ax = ax, data = df_small[df_small['number_of_reviews']<100], x="neighbourhood", y="number_of_reviews",
palette = 'plasma', whis=[5, 95])
We have the whiskers to represent the 5% and 95% percentiles meaning that inside the whiskers, 95% of the data lies. While the median is quite close to each other across neighborhoods, we can see a big difference in inner quantiles and 95% quantile. As it is generally good to read reviews before choosing where you want to stay, one can look for hosts in especially Harlem and also East Village/Hells Kitchen as there is a good chance that will have more reviews. Beware that many reviews can also be an indicator that a place is horrible. Keep that in mind and remember to read the reviews!
Some hosts have several listings which are usually designed to be rented out and rarely used privately. They can be a good starting point, if you are looking for Airbnbs that have a hotel-feeling to them. Lets try to see what hosts have the most listings in Manhattan.
## Identifying hosts with 10 most listings in Manhattan
top_host=df_manhattan.host_id.value_counts().head(10)
host_df = df_manhattan.set_index('host_id')
top_host_df=pd.DataFrame(top_host)
## Returning their rows from the original dataframe
top_host_count = host_df.loc[top_host.index]
top_host_df['name'] = host_df.loc[top_host.index].host_name.unique()
## Calculating mean price from hosts
mean_price = top_host_count.groupby('host_id').mean()
top_host_df['mean_price'] = mean_price['price']
top_host_df = top_host_df.set_index('name', drop = 'False')
We'll visualize the total listings of top host alongside the mean price pr. night.
fig, ax = plt.subplots(figsize=(14,6))
top_host_df['host_id'].plot(kind = 'bar', ax = ax, rot = 0, color=colors[3])
ax.set_xlabel('Host', fontsize=14)
ax.set_ylabel('Air Bnb listings', color = 'red', fontsize=14)
ax2=ax.twinx()
ax2.plot(top_host_df['mean_price'], 'ko-.')
ax2.set_ylabel("Mean price pr. night [$]", color="black", fontsize=14)
ax.grid(False)
ax2.grid(False)
plt.show()
If you are looking for a cheaper place to stay, listings by Mike might be a good place to stay. For luxury, look to Blueground or Pranjal.
Now we will take a quick look at the other data sets which we would like to include in order for us to choose the optimal AirBnb. The data sets contain crime incidents, noise complaints and cafes/restaurants. From our travel experiences, these attributes can have a significant effect on the holiday. We also suspect that it can have an effect on the price of an AirBnb.
Below we load in the data sets can pass them neighborhood names as the original ones did not have that attribute. These neighborhood names differ slightly from the ones in the AirBnb data set. Apparently, partioning Manhattan into neighborhoods can be done in many ways and there is no clear number of neighborhoods. The shp-file including neighborhoods that we found contains fewer neighborhoods than in the AirBnb data set. To cope with this we merge some neighborhoods into one to get a total of 20 neighborhoods instead of 32.
neighborhood_names = list(nil['NTAName'])
neighborhoods = ['washingtonheights', 'uppereastside', 'upperwestside', 'eastharlem', 'harlem', 'midtown', 'hellskitchen',
'financial', 'morningheights', 'chelsea', 'westvillage', 'eastvillage', 'lowereast', 'murrayhill',
'morningheights', 'rooseveltisland', 'soho', 'chinatown', 'gramercy', 'stuyvesant']
num_cafe = []
num_crime = []
num_noise = []
cafes_df = pd.DataFrame()
crimes_df = pd.DataFrame()
noise_df = pd.DataFrame()
for i in range(len(neighborhoods)):
place = neighborhoods[i]
nca = pd.read_csv('crime/cafes_' + str(place) + '.csv', encoding= 'unicode_escape')
ncr = pd.read_csv('crime/crimes_' + place + '.csv')
nno = pd.read_csv('crime/noise_' + place + '.csv')
nca['neighbourhood'] = neighborhood_names[i]
ncr['neighbourhood'] = neighborhood_names[i]
nno['neighbourhood'] = neighborhood_names[i]
cafes_df = cafes_df.append(nca)
crimes_df = crimes_df.append(ncr)
noise_df = noise_df.append(nno)
num_cafe.append(len(nca))
num_crime.append(len(ncr))
num_noise.append(len(nno))
Below we show how many cafes/restaurants, crime incidents and noise complaints there are across the neighbourhoods.
## Couting numbers pr. neighbourhood
df_manhattan_cafe = cafes_df.groupby('neighbourhood').count()
df_manhattan_crime = crimes_df.groupby('neighbourhood').count()
df_manhattan_noise = noise_df.groupby('neighbourhood').count()
## Visualizing
fig, ax = plt.subplots(3,1,figsize=(14,20), sharex=True)
(df_manhattan_cafe['CAMIS']/nil['Shape_Area'].values*1e5).plot(kind = 'bar', ax = ax[0], rot = 90, color = colors[3])
ax[0].set_ylabel('No. of cafes/restaurants')
(df_manhattan_crime['field_1']/nil['Shape_Area'].values*1e5).plot(kind = 'bar', ax = ax[1], rot = 70, color = colors[4])
ax[1].set_ylabel('No. of crimes')
(df_manhattan_noise['Incident Zip']/nil['Shape_Area'].values*1e5).plot(kind = 'bar', ax = ax[2], rot = 90, color = colors[5])
ax[2].set(xlabel='Neighborhood',ylabel='No. of noise complaints')
plt.show()
Midtown and Theater District share a great amount of cafes and restaurants while Stuvesant Town, Financial District and Battery Park City have very few according to our data set. Harlem and Washington Heights peak in both crime and noise complaints making it a perfect place for the adventures. East Village seems like a good place for the traveller who likes going out to eat, but doesn't fancy getting robbed in the process. This is in accordance with what we can find of information on the neighborhoods. No noise complaints were filed in Stuyvesant Town (!), hence it could be an ideal neighborhood if one appreciates silence
We can gather some of the key information obtained from the analysis above in a spatial map partioned by neighborhood in chloropath-style. To this end, we use the folium-package to display the shp-file with the neighborhoods in a open street view map. We have added the key characteristics to the shp-file in QGIS as it we couldn't figure out how to do it in Python. Hover over the neighborhoods to reveal some key characteristics!
## Creating basemap with the color grading dependent on price
mymap = folium.Map(location=[40.7788, -73.9660], zoom_start=12,tiles=None)
folium.TileLayer('CartoDB positron',name="Light Map",control=False).add_to(mymap)
myscale = (nil['Avg_Price'].quantile((0,0.15,0.30,0.45,0.60,0.85, 1))).tolist()
mymap.choropleth(
geo_data=nil,
name='Choropleth',
data=nil,
columns=['NTAName','Avg_Price'],
key_on="feature.properties.NTAName",
fill_color='YlGnBu',
threshold_scale=myscale,
fill_opacity=1,
line_opacity=0.2,
legend_name='Average rent price pr. night [$]',
smooth_factor=0
)
style_function = lambda x: {'fillColor': '#ffffff',
'color':'#000000',
'fillOpacity': 0.1,
'weight': 0.1}
highlight_function = lambda x: {'fillColor': '#000000',
'color':'#000000',
'fillOpacity': 0.50,
'weight': 0.1}
## Adding the pop-up
NIL = folium.features.GeoJson(
nil,
style_function=style_function,
control=False,
highlight_function=highlight_function,
tooltip=folium.features.GeoJsonTooltip(
fields=['NTAName','Avg_Price', 'Density', 'crime_day', 'noise_dens', 'cafe_dens'],
aliases=['Neighborhood: ','Average price pr. night [$]: ', 'Air bnb pr. km2: ', 'Crime incident pr. year pr. km2: ',
'Noise complaints pr. km2: ', 'Cafes pr. km2: '],
style=("background-color: white; color: #333333; font-family: arial; font-size: 12px; padding: 10px;")
)
)
mymap.add_child(NIL)
mymap.keep_in_front(NIL)
folium.LayerControl().add_to(mymap)
mymap
The values can be a bit confusing as we have divided by the area of the neighborhood, but we deemed it more saying when it is given relative to size. Population could also be interesting, but that data was no available for all neighborhoods. It is clear that the more expensive neighborhoods are found South of Central Park. It appears that there are quite a few cheap options around Roosevelt Island, Washington Heights, Marble Hill an Inwood.
The previous map is great for creating an overview of the different neighborhoods. In the following plot we show the individual AirBnbs as they are spread out through Manhattan. Again, we will not go through all the code, but just have few comments. Zoom in on the map to separate individual listings from the clusters and left-click to reveal name and price.
lats = list(df_manhattan.latitude)
lons = list(df_manhattan.longitude)
df_manhattan['name_and_price'] = df_manhattan['name'] + ' Price pr. night ' + df_manhattan['price'].astype('str') + '$'
df_manhattan = df_manhattan.reset_index()
from folium.plugins import MarkerCluster
locationlist = list(zip(df_manhattan['latitude'], df_manhattan['longitude']))
map2 = folium.Map(location=[40.7788, -73.9660], tiles='CartoDB positron', zoom_start=12)
home_layer = folium.FeatureGroup(name="Enitre home/apt")
home_cluster = MarkerCluster().add_to(home_layer)
private_room_layer = folium.FeatureGroup(name="Private room")
private_room_cluster = MarkerCluster().add_to(private_room_layer)
shared_room_layer = folium.FeatureGroup(name="Shared room")
shared_room_cluster = MarkerCluster().add_to(shared_room_layer)
for point in range(0, len(locationlist)):
if df_manhattan['room_type'][point] == 'Entire home/apt':
#folium.CircleMarker(locationlist[point], popup = df_manhattan['name_and_price'][point],
#fill = True, color = 'red').add_to(home_cluster)
folium.Marker(popup = df_manhattan['name_and_price'][point], location = locationlist[point], icon=folium.Icon(color='red', icon='airbnb', prefix='fa')).add_to(home_cluster)
if df_manhattan['room_type'][point] == 'Private room':
folium.Marker(popup = df_manhattan['name_and_price'][point], location = locationlist[point], icon=folium.Icon(icon='airbnb', prefix='fa',color='blue')).add_to(private_room_cluster)
if df_manhattan['room_type'][point] == 'Shared room':
folium.Marker(popup = df_manhattan['name_and_price'][point], location = locationlist[point], icon=folium.Icon(icon='airbnb', prefix='fa', color='orange')).add_to(shared_room_cluster)
map2.add_child(home_layer)
map2.add_child(private_room_layer)
map2.add_child(shared_room_layer)
myscale = (nil['Avg_Price'].quantile((0,0.15,0.30,0.45,0.60,0.85, 1))).tolist()
map2.choropleth(
geo_data=nil,
name='Neighbourhood overview',
data=nil,
columns=['NTAName','Avg_Price'],
key_on="feature.properties.NTAName",
fill_color='YlGnBu',
threshold_scale=myscale,
fill_opacity=0.6,
line_opacity=1,
legend_name='Average rent price pr. night',
smooth_factor=0
)
#folium.LayerControl().add_to(map2)
map2.add_child(folium.map.LayerControl())
#map2.save("mapcluster_final1.html")
map2